Random Forest for Gene Expression Based Cancer Classification: Overlooked Issues
نویسندگان
چکیده
Random forest is a collection (ensemble) of decision trees. It is a popular ensemble technique in pattern recognition. In this article, we apply random forest for cancer classification based on gene expression and address two issues that have been so far overlooked in other works. First, we demonstrate on two different real-world datasets that the performance of random forest is strongly influenced by dataset complexity. When estimated before running random forest, this complexity can serve as a useful performance indicator and it can explain a difference in performance on different datasets. Second, we show that one should rely with caution on feature importance used to rank genes: two forests, generated with the different number of features per node split, may have very similar classification errors on the same dataset, but the respective lists of genes ranked according to feature importance can be weakly correlated.
منابع مشابه
Classification and Biomarker Genes Selection for Cancer Gene Expression Data Using Random Forest
Background & objective: Microarray and next generation sequencing (NGS) data are the important sources to find helpful molecular patterns. Also, the great number of gene expression data increases the challenge of how to identify the biomarkers associated with cancer. The random forest (RF) is used to effectively analyze the problems of large-p and smal...
متن کاملPrediction of blood cancer using leukemia gene expression data and sparsity-based gene selection methods
Background: DNA microarray is a useful technology that simultaneously assesses the expression of thousands of genes. It can be utilized for the detection of cancer types and cancer biomarkers. This study aimed to predict blood cancer using leukemia gene expression data and a robust ℓ2,p-norm sparsity-based gene selection method. Materials and Methods: In this descriptive study, the microarray ...
متن کاملخوشهبندی دادههای بیانژنی توسط عدم تشابه جنگل تصادفی
Background: The clustering of gene expression data plays an important role in the diagnosis and treatment of cancer. These kinds of data are typically involve in a large number of variables (genes), in comparison with number of samples (patients). Many clustering methods have been built based on the dissimilarity among observations that are calculated by a distance function. As increa...
متن کاملGene Expression Data Analysis Using Data Mining Algorithms for Colon Cancer
The concept of Data mining is used in various medical applications like tumor classification, protein structure prediction, gene classification, cancer classification based on microarray data, clustering of gene expression data, statistical model of protein-protein interaction etc. Adverse drug events in prediction of medical test effectiveness can be done based on genomics and proteomics throu...
متن کاملADABOOST ENSEMBLE ALGORITHMS FOR BREAST CANCER CLASSIFICATION
With an advance in technologies, different tumor features have been collected for Breast Cancer (BC) diagnosis, processing of dealing with large data set suffers some challenges which include high storage capacity and time require for accessing and processing. The objective of this paper is to classify BC based on the extracted tumor features. To extract useful information and diagnose the tumo...
متن کامل